Code to clean the data file-by-file

Importing the necessary libraries

In [1]:
import pandas as pd
import csv
import string
import re
import nltk

nltk.download('stopwords')
nltk.download('names')
from nltk.corpus import stopwords
from nltk.corpus import names
from nltk import word_tokenize
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Aruna\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package names to
[nltk_data]     C:\Users\Aruna\AppData\Roaming\nltk_data...
[nltk_data]   Package names is already up-to-date!
In [2]:
import matplotlib.pyplot as plt
import seaborn as sns
from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator

%matplotlib inline
pd.set_option('display.max_colwidth', 150)

(A) Read the CSV File

In [3]:
df = pd.read_csv("C:\\Users\\Aruna\\Documents\\input\\Amazon CloudFront.csv")

df['description'] = df['description'].apply(lambda x: " ".join(x for x in str(x).split())) # converting to string
 
df.head(10)
Out[3]:
id label description
0 5829.0 Amazon CloudFront CloudFront Chunked Encoding and Resumable Downloads S3 supports this. Does CloudFront? If not, is this something that is coming soon? Thanks!
1 5829.0 Amazon CloudFront Hi keith4pluralsight,CloudFront supports resumable downloads, but not chunked encoding. At this time we have no plans to add support for chunked e...
2 5829.0 Amazon CloudFront Chunked encoding would be especially usedul for Lambda@Edge as low latency is critical and we want to flush the HTML to the client as soon as poss...
3 5828.0 Amazon CloudFront Debug 403 error when accessing S3 resources through CloudFront Is there a simple means for debugging a CloudFront configuration that is supposed t...
4 5828.0 Amazon CloudFront Have you tried using AWS CloudTrail? It contains all sorts of cloud events. You can search for events by username, access key, bucket name, etc. I...
5 5827.0 Amazon CloudFront Feature request: Custom headers (e.g. set HSTS, CSP, X-Frame-Options...) I'd love to see an ability to add custom headers inside CloudFront, e.g.:...
6 5827.0 Amazon CloudFront I agree. To prevent many of the hacking attempts going on today, it is important to support these headers. I would love to be able to set the foll...
7 5827.0 Amazon CloudFront Yes I just ran owasp zap against my site and the security warnings were all about the lack of secure headers on assets being served from cloudfron...
8 5827.0 Amazon CloudFront +1 for all the headers listed. It is fun to get security researcher reports and you cannot fix this issue because the required headers are not sup...
9 5827.0 Amazon CloudFront +1 for X-Frame-Options. I too ran OWASP ZAP and got warnings.
In [4]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19189 entries, 0 to 19188
Data columns (total 3 columns):
id             19188 non-null float64
label          19189 non-null object
description    19189 non-null object
dtypes: float64(1), object(2)
memory usage: 449.8+ KB

Check out one sample post:

In [5]:
p = 5

df['description'][p]
Out[5]:
"Feature request: Custom headers (e.g. set HSTS, CSP, X-Frame-Options...) I'd love to see an ability to add custom headers inside CloudFront, e.g.: - Strict-Transport-Security - Content-Security-Policy - X-Frame-Options etc. We serve a lot of content from S3, where we can't set those headers (for good reason, since that content can be served by https://s3.amazonaws.com/BUCKET/key). I'd be great to be able to add those headers inside CloudFront -- when the origin itself can't or doesn't set them."

Top 30 words + frequency of each:

In [6]:
pd.Series(' '.join(df['description']).split()).value_counts()[:30]
Out[6]:
the           66012
to            49656
I             27906
a             27772
and           22943
is            21610
for           17510
in            16773
that          16238
of            15579
you           15346
on            12817
this          12458
it            12259
CloudFront    11646
ms            11574
have          11098
from          11075
with          10629
be             9714
not            9274
your           9026
are            8618
can            8221
my             7431
as             7328
but            6552
an             6275
or             5967
-              5732
dtype: int64
In [7]:
print("There are totally", df['description'].apply(lambda x: len(x.split(' '))).sum(), "words before cleaning.")
There are totally 1689274 words before cleaning.

(B) Text Pre-processing

In [8]:
STOPWORDS = stopwords.words('english')
my_stop_words = ["hi", "hello", "regards", "thank", "thanks", "regard", "best", "wishes", "hey", "amazon", "aws", "s3",
"elastic", "beanstalk", "rds", "ec2", "lambda", "cloudfront", "cloud", "front", "vpc", "sns", "me",
"january", "february", "march", "april", "may", "june", "july", "august", "september", "october", 
"november", "december", "jan", "feb", "mar", "apr", "jun", "jul", "aug", "sep", "sept", "oct", "nov",
"dec", "monday", "tuesday", "wednesday", "thursday", "friday", "saturday", "sunday", "mon", "tue",
"wed", "thu", "fri", "sat", "sun", "ain't", "aren't", "can't", "can't've", "'cause", "could've", "couldn't",
"couldn't've", "didn't", "doesn't", "don't", "hadn't", "hadn't've", "hasn't", "haven't", "he'd", "he'd've",
"he'll", "he'll've", "he's", "how'd", "how'd'y", "how'll", "how's", "i'd", "i'd've", "i'll", "i'll've", "i'm",
"i've", "isn't", "it'd", "it'd've", "it'll", "it'll've", "it's", "let's", "mayn't", "might've", "mightn't",
"mightn't've", "must've", "mustn't", "mustn't've", "needn't", "needn't've", "oughtn't", "oughtn't've", "shan't",
"sha'n't", "shan't've", "she'd", "she'd've", "she'll", "she'll've", "she's", "should've", "shouldn't", "shouldn't've",
"so've", "so's", "that'd", "that'd've", "that's", "there'd", "there'd've", "there's", "they'd", "they'd've", "they'll",
"they'll've", "they're", "they've", "to've", "wasn't", "we'd", "we'd've", "we'll", "we'll've", "we're", "we've",
"weren't", "what'll", "what'll've", "what're", "what's", "what've", "when's", "when've", "where'd", "where's",
"where've", "who'll", "who'll've", "who's", "who've", "why's", "why've", "will've", "won't", "won't've", "would've",
"wouldn't", "wouldn't've", "yall", "yalld", "yalldve", "yallre", "yallve", "youd", "youdve", "youll",
"youllve", "youre", "youve", "do", "did", "does", "had", "have", "has", "could", "can", "as", "is",
"shall", "should", "would", "will", "you", "me", "please", "know", "who", "we", "was", "were", "edited", "by", "pm"]

name = names.words()
STOPWORDS.extend(my_stop_words)
STOPWORDS.extend(name)

REPLACE_BY_SPACE_RE = re.compile('[/(){}\[\]\|@,:;#+?]')
BAD_SYMBOLS_RE = re.compile('[^0-9a-z - _.]+')
REMOVE_HTML_RE = re.compile(r'<.*?>')
REMOVE_HTTP_RE = re.compile(r'http\S+')

STOPWORDS = [BAD_SYMBOLS_RE.sub('', x) for x in STOPWORDS]

Convert to lowercase

In [9]:
df['description'] = df['description'].apply(lambda x: " ".join(x.lower() for x in str(x).split(" ")))

df['description'][p]
Out[9]:
"feature request: custom headers (e.g. set hsts, csp, x-frame-options...) i'd love to see an ability to add custom headers inside cloudfront, e.g.: - strict-transport-security - content-security-policy - x-frame-options etc. we serve a lot of content from s3, where we can't set those headers (for good reason, since that content can be served by https://s3.amazonaws.com/bucket/key). i'd be great to be able to add those headers inside cloudfront -- when the origin itself can't or doesn't set them."

Remove all HTML tags

In [10]:
df['description'] = df['description'].apply(lambda x: " ".join(REMOVE_HTML_RE.sub(' ', x) for x in str(x).split()))

df['description'][p]
Out[10]:
"feature request: custom headers (e.g. set hsts, csp, x-frame-options...) i'd love to see an ability to add custom headers inside cloudfront, e.g.: - strict-transport-security - content-security-policy - x-frame-options etc. we serve a lot of content from s3, where we can't set those headers (for good reason, since that content can be served by https://s3.amazonaws.com/bucket/key). i'd be great to be able to add those headers inside cloudfront -- when the origin itself can't or doesn't set them."
In [11]:
df['description'] = df['description'].apply(lambda x: " ".join(REMOVE_HTTP_RE.sub(' ', x) for x in str(x).split()))

df['description'][p]
Out[11]:
"feature request: custom headers (e.g. set hsts, csp, x-frame-options...) i'd love to see an ability to add custom headers inside cloudfront, e.g.: - strict-transport-security - content-security-policy - x-frame-options etc. we serve a lot of content from s3, where we can't set those headers (for good reason, since that content can be served by   i'd be great to be able to add those headers inside cloudfront -- when the origin itself can't or doesn't set them."

Replace certain characters by space (quotation marks, parantheses etc)

In [12]:
df['description'] = df['description'].apply(lambda x: " ".join(REPLACE_BY_SPACE_RE.sub(' ', x) for x in str(x).split()))

df['description'][p]
Out[12]:
"feature request  custom headers  e.g. set hsts  csp  x-frame-options...  i'd love to see an ability to add custom headers inside cloudfront  e.g.  - strict-transport-security - content-security-policy - x-frame-options etc. we serve a lot of content from s3  where we can't set those headers  for good reason  since that content can be served by i'd be great to be able to add those headers inside cloudfront -- when the origin itself can't or doesn't set them."

Remove any unwanted symbols (like $, @ etc)

In [13]:
df['description'] = df['description'].apply(lambda x: " ".join(BAD_SYMBOLS_RE.sub('', x) for x in str(x).split()))

df['description'][p]
Out[13]:
'feature request custom headers e.g. set hsts csp xframeoptions... id love to see an ability to add custom headers inside cloudfront e.g.  stricttransportsecurity  contentsecuritypolicy  xframeoptions etc. we serve a lot of content from s3 where we cant set those headers for good reason since that content can be served by id be great to be able to add those headers inside cloudfront  when the origin itself cant or doesnt set them.'

Remove trailing punctuation marks and any symbol patterns

In [14]:
df['description'] = df['description'].apply(lambda x: " ".join(x.strip('.') for x in x.split()))
df['description'] = df['description'].apply(lambda x: " ".join(x.strip('-') for x in x.split()))
df['description'] = df['description'].apply(lambda x: " ".join(x.strip('_') for x in x.split()))
df['description'][p]
Out[14]:
'feature request custom headers e.g set hsts csp xframeoptions id love to see an ability to add custom headers inside cloudfront e.g stricttransportsecurity contentsecuritypolicy xframeoptions etc we serve a lot of content from s3 where we cant set those headers for good reason since that content can be served by id be great to be able to add those headers inside cloudfront when the origin itself cant or doesnt set them'

Remove any numbers

In [15]:
df['description'] = df['description'].apply(lambda x: " ".join(x for x in x.split() if not x.isdigit()))

df['description'][p]
Out[15]:
'feature request custom headers e.g set hsts csp xframeoptions id love to see an ability to add custom headers inside cloudfront e.g stricttransportsecurity contentsecuritypolicy xframeoptions etc we serve a lot of content from s3 where we cant set those headers for good reason since that content can be served by id be great to be able to add those headers inside cloudfront when the origin itself cant or doesnt set them'

Remove the stop words

In [16]:
df['description'] = df['description'].apply(lambda x: " ".join(x for x in x.split() if x not in STOPWORDS
                                                               and len(x) > 1))

df['description'][p]
Out[16]:
'feature request custom headers e.g set hsts csp xframeoptions love see ability custom headers inside e.g stricttransportsecurity contentsecuritypolicy xframeoptions etc serve lot content set headers good reason since content served great headers inside origin set'

Results after cleaning data:

In [17]:
df.head()
Out[17]:
id label description
0 5829.0 Amazon CloudFront chunked encoding resumable downloads supports something coming soon
1 5829.0 Amazon CloudFront keith4pluralsight supports resumable downloads chunked encoding time plans support chunked encoding certainly interested community opinion.thanks ...
2 5829.0 Amazon CloudFront chunked encoding especially usedul edge low latency critical want flush html client soon possible allow start parsing prefetching critical resourc...
3 5828.0 Amazon CloudFront debug error accessing resources simple means debugging configuration supposed provide access contents bucket trying find consistent means creating...
4 5828.0 Amazon CloudFront tried using cloudtrail contains sorts events search events username access bucket name etc seeing specific events try setting logging relevant buc...

Top 30 words + frequency of each:

In [18]:
pd.Series(' '.join(df['description']).split()).value_counts()[:30]
Out[18]:
ms              11627
origin           6276
distribution     5979
using            5970
server           5625
file             5418
get              5190
use              4891
request          4874
bucket           4539
files            4227
content          4041
like             3749
time             3608
error            3527
access           3437
see              3307
one              3285
issue            3271
set              3076
video            3010
problem          2970
streaming        2940
also             2889
cache            2830
edge             2827
new              2704
object           2602
need             2566
help             2550
dtype: int64
In [19]:
print("There are totally", df['description'].apply(lambda x: len(x.split(' '))).sum(), "words after cleaning.")
There are totally 825876 words after cleaning.

(C) Write to CleanText.csv

In [20]:
with open('C:\\Users\\Aruna\\Documents\\ACMS-IID\\input\\CleanText.csv', 'a', encoding='utf-8', newline='') as csvfile:
    writer = csv.writer(csvfile)
    # writer.writerow(['id', 'label', 'description'])
    for i in range(0, len(df['description'])):
        if len(df['description'][i]) > 1:
            writer.writerow([df['id'][i], df['label'][i], df['description'][i]])

(D) Generate the word cloud

In [21]:
msgs = " ".join(str(msg) for msg in df['description'])
fig, ax = plt.subplots(1, 1, figsize  = (100,100))
wordcloud = WordCloud(max_font_size = 20, max_words = 20, background_color = "white").generate(msgs)
ax.imshow(wordcloud, interpolation='bilinear')
ax.axis('off')
Out[21]:
(-0.5, 399.5, 199.5, -0.5)
In [22]:
msgs = " ".join(str(msg) for msg in df['description'])
fig, ax = plt.subplots(1, 1, figsize  = (100,100))
wordcloud = WordCloud(max_font_size = 20, max_words = 50, background_color = "white").generate(msgs)
ax.imshow(wordcloud, interpolation='bilinear')
ax.axis('off')
Out[22]:
(-0.5, 399.5, 199.5, -0.5)
In [23]:
msgs = " ".join(str(msg) for msg in df['description'])
fig, ax = plt.subplots(1, 1, figsize  = (100,100))
wordcloud = WordCloud(max_font_size = 20, max_words = 100, background_color = "white").generate(msgs)
ax.imshow(wordcloud, interpolation='bilinear')
ax.axis('off')
Out[23]:
(-0.5, 399.5, 199.5, -0.5)
In [24]:
msgs = " ".join(str(msg) for msg in df['description'])
fig, ax = plt.subplots(1, 1, figsize  = (100,100))
wordcloud = WordCloud(max_font_size = 20, max_words = 500, background_color = "white").generate(msgs)
ax.imshow(wordcloud, interpolation='bilinear')
ax.axis('off')
Out[24]:
(-0.5, 399.5, 199.5, -0.5)
In [25]:
msgs = " ".join(str(msg) for msg in df['description'])
fig, ax = plt.subplots(1, 1, figsize  = (100,100))
wordcloud = WordCloud(max_font_size = 20, max_words = 1000, background_color = "white").generate(msgs)
ax.imshow(wordcloud, interpolation='bilinear')
ax.axis('off')
Out[25]:
(-0.5, 399.5, 199.5, -0.5)
In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]: